feat: add ResourceRegistrar wiring, plugin-data mount, and upload auth#54
Merged
Conversation
3c6fcd7 to
e4a7b74
Compare
Signed-off-by: Noppanat Wadlom <[email protected]>
FLOWMESH_PLUGIN_DIR is deliberately read-only — it's the code mount — so plugins that need persistence (a SQLite ACL, a cache file) currently have to ask each operator to add a custom bind via docker-compose.override.yml, which is brittle. Add a second mount at /app/plugin-data driven by FLOWMESH_PLUGIN_DATA_DIR. A path value (the default, `./plugin-data`) is a host bind, auto-created on stack up; a bare name is an external Docker volume of that name, which the operator precreates and owns the lifecycle of. The CLI discriminates path vs bare name in apply_plugin_data_env and routes the value to either the service mount directly (path mode) or an internal FLOWMESH_PLUGIN_DATA_VOLUME parameter consumed by a parameterized external volume declaration (volume mode). Signed-off-by: Noppanat Wadlom <[email protected]>
The default external volume name was the bare alias `flowmesh_plugin_data`, which would collide across multiple stacks sharing a host (FLOWMESH_STACK_SUFFIX exists exactly to scope these names per-stack). Match the convention used by the other named volumes in the same file. Signed-off-by: Noppanat Wadlom <[email protected]>
`apply_plugin_data_env` was hooked into `_stack()._load`, which runs on every stack subcommand (`up`, `down`, `ps`, `logs`, `clean`, ...) via DockerComposeStack.run, so `ensure_dir(resolved)` materialized ./plugin-data in the operator's cwd even on read-only commands. FLOWMESH_PLUGIN_DIR avoids this by routing its mkdir through `ensure_deploy_paths`, which only runs when `to_deploy=True`. Split the helper: env-var routing (path resolution + alias re-route) stays in `apply_plugin_data_env` so compose substitution always has the right values; the `ensure_dir` call moves into `ensure_deploy_paths`, gated on `FLOWMESH_PLUGIN_DATA_VOLUME` being unset (volume mode is operator-managed, no host dir to create). Also drops the redundant "~/" prefix from _PLUGIN_DATA_PATH_PREFIXES (the trailing "~" already subsumes it) and adds five unit tests covering the empty / relative / absolute / tilde / bare-name branches. Signed-off-by: Noppanat Wadlom <[email protected]>
When the host restarts past a ResourceRegistrar's TTL, persistent plugins were silently dropping grants on long-lived workers and workflows. The reconcile sweep replaces TTL pruning with a stronger guarantee: on boot, enumerate every live resource (workflows + their tasks, nodes, workers) and hand the full batch to each registrar's `refresh`, then call `purge_stale` once per registrar. Registrars that persist state use the pair to drop records the server no longer tracks. - `auth.security.refresh_resources(refs, logger)` and `purge_stale_resources(logger)` fan out to every `ResourceRegistrar`. - `_lifespan` runs `_reconcile_resources()` once after plugins load and the system principal resolves. Worker / workflow enumeration is gated on `IS_ROOT_NODE`; nodes are always present. - docs/PLUGINS.md notes the new sweep on the `ResourceRegistrar` bullet. Requires lumid-hooks 0.2.0 for the new Protocol methods. The `tool.uv.sources` entry pointing at `../lumid.hooks` is temporary — drop it (and refresh the pinned requirements files) once 0.2.0 is on PyPI. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Noppanat Wadlom <[email protected]>
Review findings on the startup-reconcile work: - Move `_reconcile_resources()` after `SUPERVISOR.start(...)` so this node is in `NODE_REGISTRY` by the time the sweep enumerates it. Previously the local node was systematically missing from the refresh batch. - `refresh_resources` now returns the set of registrar names whose `refresh` raised, and `purge_stale_resources(..., skip=...)` excludes them. A registrar that fails mid-sweep no longer wipes rows it never marked live; other registrars still complete normally. - `docs/PLUGINS.md` locates the sweep relative to plugin install / supervisor handshake and documents the host-side guard. - Style/ops: `TODO(lumid-hooks-0.2.0):` prefix on the `tool.uv.sources` comment so future cleanup surfaces in grep; `uv.lock` revision restored to 3 (was downgraded by an older local uv). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Noppanat Wadlom <[email protected]>
Match lumid-hooks 0.2.0's single-method Protocol. The fan-out helper in `auth.security` becomes one call; the failed-set plumbing in `_lifespan` goes away because `reconcile` is atomic on the plugin side — a raised exception leaves the registrar's store unchanged, so the host can log and move on. - `auth.security.reconcile_resources(resources, logger)` replaces `refresh_resources` / `purge_stale_resources`. Each registrar's `reconcile` is wrapped in try/except + `logger.exception`; the sweep continues with the next registrar on failure. - `_lifespan` calls `reconcile_resources(refs, logger)` once and drops the failed-set threading. - Tests trade the old refresh/purge dispatch tests for one happy-path fan-out test and one that asserts a failing registrar is isolated while the rest still run. - docs/PLUGINS.md reflects the single-call shape and the atomic-or-rollback semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Noppanat Wadlom <[email protected]>
lumid-hooks 0.2.0 (with the new `ResourceRegistrar.reconcile`) is released on PyPI, so the editable path override from `[tool.uv.sources]` is dropped. flowmesh-hook's pin moves to >=0.2.0 since it re-exports the breaking-changed Protocol surface. Server requirements regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Noppanat Wadlom <[email protected]>
`_reconcile_resources` walked `NODE_REGISTRY` and `WORKER_REGISTRY` but read `.node_id` / `.worker_id` off the records — those attributes don't exist on the `Node` / `Worker` Pydantic models, both of which expose `.id`. The mismatch crashed the FastAPI lifespan during the startup reconcile sweep, taking the whole server down on every boot. Caught while running the lumid plugin e2e plan against a freshly built `FLOWMESH_VERSION=permission` image. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Noppanat Wadlom <[email protected]>
Signed-off-by: Noppanat Wadlom <[email protected]>
3d73a65 to
446730f
Compare
`ingest_result`, `upload_result_file`, and `upload_task_trace` authenticated the caller but ran no authorization, so any valid token could write or overwrite a task's result, artifacts, or trace files. Gate all three with a kind-level `require_permission(RESULT, None, WRITE)`. The check is kind-level because the uploader is an operator/worker key that holds no per-task grant; a concrete check would resolve to the owning task's grant and deny legitimate uploads. A no-op when no PermissionChecker is registered, so default behavior is unchanged. Adds deny-path tests for each endpoint. Signed-off-by: Noppanat Wadlom <[email protected]>
Drop the "(registrar plugins)" parenthetical; the lines below already explain why the sweep runs after the supervisor handshake. Signed-off-by: Noppanat Wadlom <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
FLOWMESH_PLUGIN_DIRis read-only — it's the code mount — so plugins that need writable persistence (a SQLite ACL, a cache, an ownership table) have nowhere to write. This adds a second, writable mount at/app/plugin-data, driven by a singleFLOWMESH_PLUGIN_DATA_DIRenv var. The immediate consumer islumid-flowmesh-plugin'sResourceRegistrar/PermissionCheckerACL.It also closes an authorization gap on the result and trace upload endpoints: they authenticated the caller but ran no authorization, so any valid token could overwrite another task's results, artifacts, or trace files. They now require a kind-level
RESULT/WRITEpermission. This is a no-op when noPermissionCheckeris registered, so default behavior is unchanged; it pairs with theflowmesh:results:writescope in the matchinglumid-flowmesh-pluginPR.Changes
cli/stack/.../assets/compose.yml,env_schema.py,.env.example,utils.py,stack.py: add the/app/plugin-datamount and routeFLOWMESH_PLUGIN_DATA_DIRto either a host bind or an external Docker volume.src/server/routers/v1/results.py,traces.py: gateingest_result,upload_result_file, andupload_task_traceon kind-levelRESULT/WRITE, with deny-path tests.docs/ENV.md,docs/PLUGINS.md,hook/README.md: document the three mount modes; fold in a pre-existingHookBindings/BaseBindingscorrection.Design
FLOWMESH_PLUGIN_DATA_DIRis one env var with three modes, discriminated by the same rule docker compose uses for short-syntax mount sources (inapply_plugin_data_env): a path (relative resolves against the operator's cwd, or absolute) becomes a host bind auto-created onstack up; a bare name becomes an external volume the operator precreates withdocker volume create. Plugin code stays read-only; plugin data is writable and separately mounted.The upload gate is kind-level (
RESULT,None,WRITE) rather than per-task because the uploader is the worker/operator key, which holds no per-task grant — authorization is by scope, not ownership.Test Plan
docker compose configacross all threeFLOWMESH_PLUGIN_DATA_DIRmodes (empty/path default, custom path, bare volume name).flowmesh stack up+ CPU worker) against a locally built image carrying this branch, using a logging-only dummy plugin to observe FlowMesh's hook wiring against the new mount.Test Result
docker compose configproduced the expectedbind/volumeshapes for all three modes.End-to-end (passed):
reconcile_resourcesonce at boot with the full live enumeration, andregister/require_permission/resolve_accessible_idsat the expected endpoints./app/plugin-data, survives astack down/stack up, and the startup reconcile correctly refreshes live grants and prunes stale ones.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen).[BREAKING]and described migration steps above.